Skip to content

Conversation

ggerganov
Copy link
Member

fix #16590

The server_tokens::keep_first() logic did not handle correctly the case where we keep all tokens just up until the end of an image.

Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that works, thanks!

@ngxson ngxson merged commit 554fd57 into master Oct 15, 2025
66 of 67 checks passed
@ngxson
Copy link
Collaborator

ngxson commented Oct 15, 2025

@ggerganov this works well if I regenerate response for the same prompt, but I'm facing another problem where if I start a new conversation, the server will crash.

To reproduce the problem:

  • Start the server
  • Create the first conversation by adding an image, then ask it "what do you see"
  • Create the second conversation by adding another image, then ask it "tell me about this"
main: server is listening on http://0.0.0.0:8080 - starting the main loop
srv  update_slots: all slots are idle
srv  log_server_r: request: GET /props 192.168.20.1 200
srv  log_server_r: request: GET /props 192.168.20.1 200
srv  log_server_r: request: GET /props 192.168.20.1 200
srv  log_server_r: request: GET /props 192.168.20.1 200
srv  log_server_r: request: GET /props 192.168.20.1 200
srv  log_server_r: request: GET /props 192.168.20.1 200
srv  log_server_r: request: GET /props 192.168.20.1 200
srv  log_server_r: request: GET /props 192.168.20.1 200
srv  log_server_r: request: GET / 192.168.20.1 200
srv  log_server_r: request: GET /props 192.168.20.1 200
srv  log_server_r: request: GET /props 192.168.20.1 200
srv  log_server_r: request: GET /props 192.168.20.1 200
srv  update_slots: all slots are idle
srv  log_server_r: request: GET /slots 192.168.20.1 200
srv  log_server_r: request: GET /health 127.0.0.1 200
srv  log_server_r: request: GET /props 192.168.20.1 200
srv  log_server_r: request: GET /props 192.168.20.1 200
srv  params_from_: Chat format: Content-only
slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = -1
slot launch_slot_: id  0 | task 1 | processing task
slot update_slots: id  0 | task 1 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 1028
slot update_slots: id  0 | task 1 | n_past = 0, memory_seq_rm [0, end)
slot update_slots: id  0 | task 1 | prompt processing progress, n_past = 15, n_tokens = 15, progress = 0.014591
slot update_slots: id  0 | task 1 | n_past = 15, memory_seq_rm [15, end)
srv  process_chun: processing image...
srv  log_server_r: request: GET /health 127.0.0.1 200
srv  process_chun: image processed in 33718 ms
slot update_slots: id  0 | task 1 | prompt processing progress, n_past = 1028, n_tokens = 5, progress = 1.000000
slot update_slots: id  0 | task 1 | prompt done, n_past = 1028, n_tokens = 5
slot update_slots: id  0 | task 1 | created context checkpoint 1 of 8 (pos_min = 1022, pos_max = 1022, size = 0.078 MiB)
srv  log_server_r: request: GET /slots 192.168.20.1 200
srv  log_server_r: request: GET /slots 192.168.20.1 200
srv  log_server_r: request: GET /slots 192.168.20.1 200
srv  log_server_r: request: GET /slots 192.168.20.1 200
srv  log_server_r: request: GET /slots 192.168.20.1 200
srv  log_server_r: request: GET /slots 192.168.20.1 200
slot print_timing: id  0 | task 1 | 
prompt eval time =   33875.31 ms /  1028 tokens (   32.95 ms per token,    30.35 tokens per second)
       eval time =    3535.76 ms /   219 tokens (   16.15 ms per token,    61.94 tokens per second)
      total time =   37411.07 ms /  1247 tokens
slot      release: id  0 | task 1 | stop processing: n_past = 1246, truncated = 0
srv  update_slots: all slots are idle
srv  log_server_r: request: POST /v1/chat/completions 192.168.20.1 200
srv  log_server_r: request: GET /health 127.0.0.1 200
srv  log_server_r: request: GET /props 192.168.20.1 200
srv  params_from_: Chat format: Content-only
slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = 6500810608735
slot launch_slot_: id  0 | task 228 | processing task
slot update_slots: id  0 | task 228 | new prompt, n_ctx_slot = 4096, n_keep = 0, n_prompt_tokens = 270
libggml-base.so(+0x183cb)[0x7c07727973cb]
libggml-base.so(ggml_print_backtrace+0x21f)[0x7c077279782f]
libggml-base.so(+0x2b20f)[0x7c07727aa20f]
/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae20c)[0x7c07725ff20c]
/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae277)[0x7c07725ff277]
/lib/x86_64-linux-gnu/libstdc++.so.6(+0xae4d8)[0x7c07725ff4d8]
/lib/x86_64-linux-gnu/libstdc++.so.6(+0xa54cd)[0x7c07725f64cd]
libllama.so(+0x193c16)[0x7c07729c5c16]
libllama.so(_ZNK11llama_vocab4impl14token_to_pieceEiPciib+0x55)[0x7c07729c8335]
/app/llama-server(+0x1f2559)[0x59664514a559]
/app/llama-server(+0x1f2626)[0x59664514a626]
/app/llama-server(+0xf5f72)[0x59664504df72]
/app/llama-server(+0x9594f)[0x596644fed94f]
/app/llama-server(+0x552b3)[0x596644fad2b3]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90)[0x7c077224ad90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80)[0x7c077224ae40]
/app/llama-server(+0x56d35)[0x596644faed35]
terminate called after throwing an instance of 'std::out_of_range'
  what():  vector::_M_range_check: __n (which is 18446744073709551615) >= this->size() (which is 65536)

@ggerganov ggerganov deleted the gg/server-fix-mtmd-checkpoints branch October 15, 2025 13:40
@ggerganov
Copy link
Member Author

Does this fix it? #16595

yael-works pushed a commit to yael-works/llama.cpp that referenced this pull request Oct 15, 2025
gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request Oct 15, 2025
* origin/master:
Add server-driven parameter defaults and syncing (ggml-org#16515)
metal: optimise `GGML_OP_SUM` (ggml-org#16559)
server : fix img token logs (ggml-org#16595)
llama-quant: add support for mmproj (ggml-org#16592)
CUDA: Changing the CUDA scheduling strategy to spin (ggml-org#16585)
server : fix mtmd checkpoints (ggml-org#16591)
metal : avoid using Metal's gpuAddress property (ggml-org#16576)
vulkan: Add ACC_TYPE_VEC2 implementation (ggml-org#16203)
CUDA + openCL: fix bug in accessing rms_norm->src while doing fusion (ggml-org#16577)
vulkan: Support FA with K/V in F32 (ggml-org#16543)
vulkan: Improve build time for MSVC (ggml-org#16545)
CUDA: enable FA for FP32 KV cache (ggml-org#16546)
CUDA: use fastdiv + ggml_cuda_mad for mmvf (ggml-org#16557)
CUDA: add fp kernel for larger batch size MoE (ggml-org#16512)
cuda : remove legacy copy-op pointer indirection code (ggml-org#16485)
server : dynamic token limit for prompt cache (ggml-org#16560)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Misc. bug: Server: LFM2-VL crashes on checkpoint restoring with image input

2 participants